Monday, November 27, 2006

Appendix to Optimizer Project Handout

The documentation for the data-parallelization library are posted on the website. Link

The evil enter instruction

From your classmate Zev:

After some experimentation, we found that the enter instruction seems to be broken (using the push, mov, sub equivalent works fine). Some googling turned up the following discussion group thread:

http://groups.google.co.nz/group/comp.os.linux.development.system/browse_thread/thread/a057249198598933/a4f5251c9ef1e7a2?#a4f5251c9ef1e7a2.

Everything is pretty much summed up with Linus' reply at the end. Bottom line: enter can cause segfaults on Linux and is a lot slower than its equivalent.

Thursday, November 16, 2006

72 + 16 = 96 (Explanation for the curious)

Yeah, um, that was me trying to hid some stuff from you and forgetting what I was glossing over. If you are interested:

Some x86-64 instructions are very complex. They are translated into simpler operations in hardware. In recitation we saw that a mov that references memory is translated into multiple simple operations in the underlying hardware, the mov and the memory reference. The mov is dependent upon the memory reference. These smaller instructions are called micro-operations.

There are 16 architectural registers and 72 re-order buffer entries. That is 88 registers. But there are also 8 hidden registers that are used for shuffling values between micro-operations. These are not part of architectural state (the state the asm programmer sees) but they are needed to shuffle values between micro-ops when a dependent micro-op is evicted from the re-order buffer.

Running without benchmarking

The new lib6035.a and the new assemble require that you use the 'benchmark' script to run your code. If don't you want to benchmark, maybe you are just debugging, use the original lib6035.a in /mit/6.035/provided/optimizer/lib and compile with the old command:

gcc4 example.s -L. -l6035 -pthread -o example
cc example.s -L. -l6035 -pthread -o example

Wednesday, November 15, 2006

New benchmarking infrastructure!

With help from our sys-admin (thanks Mike!), I have completed the new, much much more accurate benchmarking infrastructure. You have to make some changes to use it.

First, set your LD_LIBRARY_PATH environment variable to include /u/mgordon/6035/lib64


For bash:
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/u/mgordon/6035/lib64

For csh:
setenv LD_LIBRARY_PATH $LD_LIBRARY_PATH:/u/mgordon/6035/lib64

I have placed a new version of the 6035 library in /u/mgordon/6035/lib64, you don't need
to copy it from there.

Now you have to compile things a bit differently, new compile:

gcc4 program.s -pthread -lpapiex -l6035 -L/u/mgordon/6035/lib64

Just in case your are curious, papiex is the library on which I built our benchmarking infrastructure. It allows us to get access to performance counters on the CPU! Very accurate.

----

You can use /u/mgordon/6035/bin/benchmark to get the cycle count information for your program after you have compiled it.

Use 'benchmark program' to get the cycle count for program.

Here is some sample output:

loop
libmonitor debug: (P20071,T0x0) monitor_fini_process()
PapiEx Version: 0.99rc2
Executable: /u/mgordon/6.035/example/loop
Processor: AMD K8 Revision C
Clockrate: 1993.465942
Parent Process ID: 20064
Process ID: 20071
Hostname: tyner
Options: PAPI_TOT_CYC,NO_WRITE
Domain: User
Real usecs: 125
Real cycles: 241852
Proc usecs: 122
Proc cycles: 239896
PAPI_TOT_CYC: 135670


Event descriptions:
Event: PAPI_TOT_CYC
Derived: No
Short Description: Total cycles
Long Description: Total cycles
Developer's Notes:


Start: Wed Nov 15 23:44:20 2006
Finish: Wed Nov 15 23:44:20 2006
libmonitor debug: (P20071,T0x0) monitor_fini_library()

------

What to notice:
*First line give the program you ran (in this case "loop").
*The line your are interested in: PAPI_TOT_CYC: 135670
this is the number of user cycles that your program took to run.


----

You can also define a single "caliper" in the code. This is a section in your code that you would like detailed information about. Use start_caliper() to define the beginning of a section and end_caliper() to define the end of a section. These functions are defined in lib6035.a. So you can use a callout for each in decaf code or just place it in your assembly code (adhering to calling convention of course).

With a caliper defined, the output would look like:
loop
libmonitor debug: (P20167,T0x0) monitor_fini_process()
PapiEx Version: 0.99rc2
Executable: /u/mgordon/6.035/example/loop
Processor: AMD K8 Revision C
Clockrate: 1993.465942
Parent Process ID: 20156
Process ID: 20167
Hostname: tyner
Options: PAPI_TOT_CYC,NO_WRITE
Domain: User
Real usecs: 435
Real cycles: 860899
Proc usecs: 127
Proc cycles: 249792
PAPI_TOT_CYC: 150002

Caliper 1:
Executions: 1
Real usecs: 16
Real cycles: 32333
Proc usecs: 16
Proc cycles: 32328
PAPI_TOT_CYC: 32226 ***This is the cycle count
for your caliper

Event descriptions:
Event: PAPI_TOT_CYC
Derived: No
Short Description: Total cycles
Long Description: Total cycles
Developer's Notes:


Start: Wed Nov 15 23:51:48 2006
Finish: Wed Nov 15 23:51:48 2006
libmonitor debug: (P20167,T0x0) monitor_fini_library()

Let me know if there are any problems. Actually, let me know if it works for you! It is somewhat untested for anyone but me.

Mike

Monday, November 13, 2006

New option needed when assembling!

Although it has not been discussed, the new 6035 static library (lib6035.a) includes the data parallelization library. Even if you are not data parallelizing, you have to compile the library with the -pthread option:

gcc4 example.s -L. -l6035 -pthread -o example
cc example.s -L. -l6035 -pthread -o example

Sunday, November 12, 2006

Quiz 2

Quiz 2 has been graded. You will get your quiz back during Monday's Lecture (Nov. 13). The class average is 77 and the standard deviation is 13.

Monday, November 06, 2006

Quiz 2 Review Notes

I hope the quiz review helped. You can find my typed notes here. They are a bit rough (I hate Word, should have used latex) but they should help you to understand dataflow analysis. If you have any questions, you can email me.

Saturday, November 04, 2006

Quiz 2 Information

The second quiz is schedule for Tuesday during class time.

Remember, it is open book/notes!

I have posted a practice exam and solutions on the web site. You can also look at last year's practice quiz, but be careful with problem 4, there are bugs in it.

I am going to give a quiz review during class on Monday (11am, normal room). I will answer any questions and go over the material (mostly dataflow analysis).

The material:
  • Assembly Language
  • Low-level intermediate representation (3-address code, etc.)
  • Translation from high-level representation to low-level (example, for-loop->assembly)
  • Control Flow Analysis and Basic Blocks
  • Intra-basic-block optimizations (value numbering, etc.)
  • Dataflow Analysis including transfer functions, confluence, Global CSE, copy-prop, dce, liveness, and the theoretical framework (lattices, transfer functions, etc.)

Friday, November 03, 2006

Static Single-Assignment (SSA) Readings

For those interested in SSA form for their compiler, I recommend you first look at Chapter 19 of the Tiger Book (2nd ed.). This is good introduction to SSA. After this chapter you should be able to decide if you want to use SSA.

The Arc Book contains the most thorough coverage of SSA. Chapter 9.3 focuses on conversion to SSA and Chapter 10 discusses transformations in an SSA framework.

The Whale Book and Optimizing Compilers for Modern Architectures include brief descriptions of SSA also, but they are very limited.

Wednesday, November 01, 2006

Extension for Dataflow Optimization Project

Design Document: Friday 11/3 @ 5pm (either email me the doc or place in class inbox)

Implementation and Design: Sunday 11/12 @ 11:59pm